intrinsic reward
AMixture of Surprises for Unsupervised Reinforcement Learning
Unsupervised reinforcement learning aims at learning a generalist policy in a reward-free manner for fast adaptation to downstream tasks. Most of the existing methods propose to provide an intrinsic reward based on surprise. Maximizing or minimizing surprise drives the agent to either explore or gain control over its environment. However, both strategies rely on a strong assumption: the entropy of the environment's dynamics is either high or low. This assumption may not always hold in real-world scenarios, where the entropy of the environment's dynamics may be unknown.
analysis of Algorithm
In this section, we provide a convergence rate analysis for Algorithm 1. Similar to Hazan et al. [36], Algorithm 1 has access to an approximate density oracle and an approximate planner defined below: Visitation density oracle: We assume access to an approximate density estimator that takes in a policy and a density approximation error d 0 as inputs and returns หd such that kd หd k1 d. Approximate planning oracle: We assume access to an approximate planner that, given any MDP M and error tolerance p 0, returns a policy such that JM() max JM() p. A.1 Proof of Theorem 1 We first give the following proposition that captures certain properties of the proposed objective. The proof is postponed to the end of this section. Taking the above proposition as given for the moment, we prove Theorem 1 following steps similar to those of Hazan et al. [36, Theorem 4.1]. Since k returned by the approximate planning oracle is an p-optimal policy in Mk, we have (1) 1hd k,rki (1) 1hd,rki p for any policy, including?. Therefore, It is straightforward to check that setting 0.1 1, p 0.1, d 0.1 1, 0.1, and the number of iterations K 1 log(10B 1) yields the claim of Theorem 1. Remark 2. Since the temperature parameter k in Proposition 1 goes to zero as k increases, one can show that the expected value of policy returned by Algorithm 1 converges to the maximum performance J(?).
Exploration-Guided Reward Shaping for Reinforcement Learning under Sparse Rewards
We study the problem of reward shaping to accelerate the training process of a reinforcement learning agent. Existing works have considered a number of different reward shaping formulations; however, they either require external domain knowledge or fail in environments with extremely sparse rewards. In this paper, we propose a novel framework, Exploration-Guided Reward Shaping (EXPLORS), that operates in a fully self-supervised manner and can accelerate an agent's learning even in sparse-reward environments. The key idea of EXPLORS is to learn an intrinsic reward function in combination with exploration-based bonuses to maximize the agent's utility w.r.t.
Exploration-Guided Reward Shaping for Reinforcement Learning under Sparse Rewards
We study the problem of reward shaping to accelerate the training process of a reinforcement learning agent. Existing works have considered a number of different reward shaping formulations; however, they either require external domain knowledge or fail in environments with extremely sparse rewards. In this paper, we propose a novel framework, Exploration-Guided Reward Shaping (EXPLORS), that operates in a fully self-supervised manner and can accelerate an agent's learning even in sparse-reward environments. The key idea of EXPLORS is to learn an intrinsic reward function in combination with exploration-based bonuses to maximize the agent's utility w.r.t.
To facilitate the following derivation, we rewrite the objective J E+I(E+I) JE(E): 438 J E+I(E+I) JE(E) = E E+ I h 1X
A.1 Full derivation425 We present the complete derivation of the objective function in each subproblem defined in Section426 3.2. For brevity, let rt =(1+)rEt +rIt and V EE (st)= Vt. Under this assumption, E serves as 0 (see above). This451 enables updating E+I using the local approximation. We leave relaxing this assumption as future452 work.453
Episodic Multi agent Reinforcement Learning with Curiosity driven Exploration
Efficient exploration in deep cooperative multi-agent reinforcement learning (MARL) still remains challenging in complex coordination problems. In this paper, we introduce a novel Episodic Multi-agent reinforcement learning with Curiosity-driven exploration, called EMC. We leverage an insight of popular factorized MARL algorithms that the "induced" individual Q-values, i.e., the individual utility functions used for local execution, are the embeddings of local actionobservation histories, and can capture the interaction between agents due to reward backpropagation during centralized training. Therefore, we use prediction errors of individual Q-values as intrinsic rewards for coordinated exploration and utilize episodic memory to exploit explored informative experience to boost policy training. As the dynamics of an agent's individual Q-value function captures the novelty of states and the influence from other agents, our intrinsic reward can induce coordinated exploration to new or promising states. We illustrate the advantages of our method by didactic examples, and demonstrate its significant outperformance over state-of-the-art MARL baselines on challenging tasks in the StarCraft II micromanagement benchmark.
Episodic Multi agent Reinforcement Learning with Curiosity driven Exploration
Efficient exploration in deep cooperative multi-agent reinforcement learning (MARL) still remains challenging in complex coordination problems. In this paper, we introduce a novel Episodic Multi-agent reinforcement learning with Curiosity-driven exploration, called EMC. We leverage an insight of popular factorized MARL algorithms that the "induced" individual Q-values, i.e., the individual utility functions used for local execution, are the embeddings of local actionobservation histories, and can capture the interaction between agents due to reward backpropagation during centralized training. Therefore, we use prediction errors of individual Q-values as intrinsic rewards for coordinated exploration and utilize episodic memory to exploit explored informative experience to boost policy training. As the dynamics of an agent's individual Q-value function captures the novelty of states and the influence from other agents, our intrinsic reward can induce coordinated exploration to new or promising states. We illustrate the advantages of our method by didactic examples, and demonstrate its significant outperformance over state-of-the-art MARL baselines on challenging tasks in the StarCraft II micromanagement benchmark.